An Integrated Clustering and Classification Approach for the Analysis of Tumor Patient Data

نویسندگان

  • Stephan M. Winkler
  • Michael Affenzeller
  • Herbert Stekel
چکیده

Standard patient parameters, tumor markers, and tumor diagnosis records are used for identifying prediction models for tumor markers as well as cancer diagnosis predictions. In this paper we present a hybrid clustering and classification approach that first identifies data clusters (using standard patient data and tumor markers) and then learns prediction models on the basis of these data clusters. The so formed clusters are analyzed and their homogeneity is calculated; the models learned on the basis of these clusters are tested and compared to each other with respect to classification accuracy and variable impacts. 1 An Integrated Clustering and Classification Approach for the Identification of Predictors for Tumor Diagnoses The overall goal of the research described here is to identify prediction models for tumor markers (TM) and tumor diagnoses. In previous work ([13], [16]) we have identified classification models that can be used as virtual tumor markers for estimating TM values on the basis of standard blood parameters. Tumor markers are substances (found in blood and/or body tissues) that can be used as indicators for certain types of cancer ([4], [14]). Moreover, in [14] and [15] we have published research results achieved in the identification of prediction models for tumor diagnoses. As described in [15], the use of TM prediction models as virtual tumor markers increases the achievable classification accuracy. The here proposed analysis approach (schematically shown in Figure 1) integrates clustering and classification algorithms: First, the available patient data are clustered; this clustering is done on the one hand only for standard blood data and on the other hand for standard The work described in this paper was done within the Josef Ressel Centre for Heuristic Optimization Heureka! (http://heureka.heuristiclab.com/) sponsored by the Austrian Research Promotion Agency (FFG). data plus tumor markers. The so identified clusters of samples are analyzed and compared with each other; we especially analyze the size of the clusters and to which extent samples which are assigned the same clusters regarding standard data are also assigned to the same clusters on the basis of standard and tumor marker data. Within the Heureka! research project we have applied several clustering approaches including k-means clustering and soft k-means clustering ([9], [8]) as well as the identification of Gaussian mixture models using expectation maximization techniques [18]. As simpler models are to be preferred over more complex ones, the quality of clusterings is calculated considering not only their quantization error, but also the number of clusters formed; the Davies-Bouldin index [3] as well as the Akaike information criterion [2] can be used, e.g. Fig. 1. An integrated clustering and classification approach for the analysis of medical data: Data clusters are formed using standard data and optionally also tumor marker data; these clusters are the basis for the identification of classifiers that can be used as predictors for cancer diagnoses. The so clustered data are subsequently (in combination with tumor diagnosis data) used for learning tumor diagnosis predictors; each cluster is used individually for training these models. We use the following two modeling methods for identifying predictors for tumor markers and cancer diagnoses: Hybrid modeling using machine learning algorithms and evolutionary algorithms (that optimize feature selection and the modeling algorithms’ parameters) as well as genetic programming. The so identified models are analyzed and compared to each other with respect to classification accuracy and variable impacts. 2 Empirical Test Study: Clustering and Classification of Breast Cancer Patient Data 2.1 Data Basis Data of thousands of patients of the General Hospital (AKH) Linz, Austria, have been analyzed in order to identify mathematical models for cancer diagnoses. We have used a medical database compiled at the central laboratory of AKH: 28 routinely measured standard values of patients are available as well as several tumor markers. In total, information about 20,819 patients is stored in 48,580 samples. Please note that of course not all values are available in all samples; there are many missing values simply because not all blood values are measured during each examination. Further details about the data set and the applied preprocessing methods can be found in [13] and [14]. Information about cancer diagnoses is also available in the AKH database: If a patient is diagnosed with any kind of cancer, then this information is also stored in the database. Our goal in the research work described in this paper is to identify estimation models for the presence of breast cancer (BC, cancer class C50 according to the International Statistical Classification of Diseases and Related Health Problems 10th Revision (ICD-10)). Following the data preprocessing approach described in [13] and [14] we have compiled a data set specific for this kind of tumor: First, blood parameter measurements were joined with diagnosis results; only measurements and diagnoses with a time interval less than a month were considered. Second, all samples are removed that contain less than 15 valid values. Finally, variables with less than 10% valid values are removed from the data base. This procedure results in a specialized data set for the analysis of breast cancer patient data; this data set contains 706 samples (45.89% of not diseased patients forming class 0 and 54.11% of diseased patients forming class 1) containing routinely measured values of patients as well as tumor markers. This data set is the same as the BC data set used in [14]. 2.2 Clustering Results The so compiled data set of patients was clustered using k-means algorithm ([9], [8]) with varying numbers of clusters k: The cluster centers are initially set at random and then iteratively adapted until the quantization error is minimized; each sample is assigned to the cluster whose center has the minimum distance to the sample (distance is here calculated using the Euclidean distance function). As on the one hand the optimal number of clusters is unknown and different values for k have to be tried, and on the other hand simpler models are to be preferred over more complex ones, the quality of clusterings is calculated considering not only their quantization error, but also the number of clusters formed; the Davies-Bouldin index [3] is used in this context. Information about the samples’ classification (as diseased or not diseased) is of course not available for the clustering algorithm. The mean quantization error (MQE) of clusteri is defined as the average distance of its samples to its center cei, and the Davies-Bouldin Index (DBI) for a complete clustering hypothesis takes into account the compactness of the formed clusters (via their MQE) as well as their distance: MQEi = ∑ sj∈clusteri dist(sj , cei) |clusteri| (1)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Integrated DEA and Data Mining Approach for Performance Assessment

This paper presents a data envelopment analysis (DEA) model combined with Bootstrapping to assess performance of one of the Data mining Algorithms. We applied a two-step process for performance productivity analysis of insurance branches within a case study. First, using a DEA model, the study analyzes the productivity of eighteen decision-making units (DMUs). Using a Malmquist index, DEA deter...

متن کامل

An integrated heuristic method based on piecewise regression and cluster analysis for fluctuation data (A case study on health-care: Psoriasis patients)

Trend forecasting and proper understanding of the future changes is necessary for planning in health-care area.One of the problems of analytic methods is determination of the number and location of the breakpoints, especially for fluctuation data. In this area, few researches are published when number and location of the nodes are not specified.In this paper, a clustering-based method is develo...

متن کامل

Asthma Control Level Assessment by Moving from the Current Reactive Care Models into a Preventive Approach based on Fuzzy Clustering and Classification Algorithms

Background and Aim: Asthma is a common and chronic disease of respiratory tracts. The best way to treat Asthma is to control it. Experts of this field suggest the continues monitoring on Asthma symptoms and adjustment of self-care plan with offering the preventive treatment program to have desired control over Asthma. Presenting these plans by the physician is set based on the control level in ...

متن کامل

به کارگیری روش‌های خوشه‌بندی در ریزآرایه DNA

Background: Microarray DNA technology has paved the way for investigators to expressed thousands of genes in a short time. Analysis of this big amount of raw data includes normalization, clustering and classification. The present study surveys the application of clustering technique in microarray DNA analysis. Materials and methods: We analyzed data of Van’t Veer et al study dealing with BRCA1...

متن کامل

Knowledge discovery from patients’ behavior via clustering-classification algorithms based on weighted eRFM and CLV model: An empirical study in public health care services

The rapid growing of information technology (IT) motivates and makes competitive advantages in health care industry. Nowadays, many hospitals try to build a successful customer relationship management (CRM) to recognize target and potential patients, increase patient loyalty and satisfaction and finally maximize their profitability. Many hospitals have large data warehouses containing customer ...

متن کامل

Knowledge discovery from patients’ behavior via clustering-classification algorithms based on weighted eRFM and CLV model: An empirical study in public health care services

The rapid growing of information technology (IT) motivates and makes competitive advantages in health care industry. Nowadays, many hospitals try to build a successful customer relationship management (CRM) to recognize target and potential patients, increase patient loyalty and satisfaction and finally maximize their profitability. Many hospitals have large data warehouses containing customer ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013